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Abstract 

This paper presents a novel method of generating 
and applying hierarchical, dynamic topic-based lan- 
guage models. It proposes and evaluates new clus- 
ter generation, hierarchical smoothing and adaptive 
topic-probability estimation techniques. These com- 
bined models help capture long-distance lexical de- 
pendencies. Experiments on the Broadcast News 
corpus show significant improvement in perplexity 
(10.5% overall and 33.5% on target vocabulary). 

1 Introduction 

Statistical language models are core components of 
speech recognizers, optical character recog nizers and 
even some machine translation systems Brown et 
al. (1990| ). The most common language model- 



ing paradigm used today is based on n-grams, local 
word sequences. These models make a Markovian 
assumption on word dependencies; usually that word 
predictions depend on at most m previous words. 
Therefore they offer the following approximation for 
the computation of a word sequence probability: 

where iff denotes the sequence Wi . . .Wj ; a common 
size for to is 3 (trigram language models). 

Even if n-grams were proved to be very power- 
ful and robust in various tasks involving language 
models, they have a certain handicap: because of 
the Markov assumption, the dependency is hmited 
t o verv short local context . Cache language models 
(iKuhn and de Mori (19921) jRosenfeld (1994|) ) try to 
overcome this limitation by boosting the probabil- 
ity of th e words already seen in the history; trigger 
models ( Lau et al. (1993| )), even more general, try to 
capture the interrelationships be tween words. Mod- 
els ba se d on syntactic structur e ( Chelba and Jehnek 
(19981) , [Wright et al. (1993D ) effectively estimate 



intra-sentence syntactic word dependencies. 

The approach we present here is based on the 
observation that certain words tend to have differ- 
ent probability distributions in different topics. We 
propose to compute the conditional language model 
probability as a dynamic mixture model of K topic- 
specific language models: 



Empirical Observation: 

Lexical Probabilities are Sensitive to Topic and Subtopic 
P( peace I subtopic) 
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Figure 1: Conditional probability of the word peace 
given manually assigned Broadcast News topics 
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The motivation for developing topic-sensitive lan- 
guage models is twofold. First, empiricahy speaking, 
many n-gram probabilities vary substantially when 
conditioned on topic (such as in the case of content 
words following several function words) . A more im- 
portant benefit, however, is that even when a given 
bigram or trigram probability is not topic sensitive, 
as in the case of sparse n-gram statistics, the topic- 
sensitive unigram or bigram probabilities may con- 
stitute a more informative backoff estimate than the 
single global unigram or bigram estimates. Discus- 
sion of these important smoothing issues is given in 
Section ^. 

Finally, we observe that lexical probability distri- 
butions vary not only with topic but with subtopic 
too, in a hierarchical manner. For example, con- 
sider the variation of the probability of the word 
peace given major news topic distinctions (e.g. BUSI- 
NESS and INTERNATIONAL news) as ihustrated in 
Figure 0. There is substantial subtopic proba- 
bility variation for peace within international 
news (the word usage is 50-times more hkely 



in INTERNATIONAL:MIDDLE-EAST than INTERNA- 
TIONAL: JAPAN). We propose methods of hierarchical 
smoothing of P(wi|topict) in a topic-tree to capture 
this subtopic variation robustly. 

1.1 Related Work 

Recently, the speech community has begun t o ad- 
dress the issue of topic in language modeling. Lowe 



(19951) utilized the hand-assigned topic labels for 



the Switchboard speech corpus to develop topic- 
specific language models for each of the 42 switch- 
board topics, and used a single topic-dependent lan- 
guage model to rescore the lists of N-best hypothe- 
ses. Error-rate improvement over the baseline lan- 
guage model of 0.44% was reported. 



Iyer et al. (1994) used bottom-up clustering tech- 
niques on discourse contexts, performing sentence- 
level model interpolation with weights updated dy- 
namically through an EM-like procedure. Evalu- 
ation on the Wall Street Journal (WSJO) corpus 
showed a 4% perplexity reduction and 7% word er - 
ror rate reduction. In [yer and Ostendorf (1996 ), 
the model was improved by model probability rees- 
timation and interpolation with a cache model, re- 
sulting in better dynamic adaptation and an overall 
22%/3% perplexity/error rate reduction due to both 
components. 



Seymore and Rosenfeld (1997) reported significant 
improvements when using a topic detector to build 
specialized language models on the Broadcast News 
(BN) corpus. They used TF-IDF and Naive Bayes 
classifiers to detect the most similar topics to a given 
article and then built a specialized language model 
to rescore the N-best lists corresponding to the arti- 
cle (yielding an overall 15% perplexity reduction us- 
ing document-specific parameter re-estima tion, and 
no significaii t word error rate reduction). Seymore 



et al. (19981) split the vocabulary into 3 sets: gen- 
eral words, on-topic words and off-topic words, and 
then use a non-linear interpolation to compute the 
language model. This yielded an 8% perplexity re- 
duction and 1% relative word error rate reduction. 



In collaborative work, Mangu (1997) investigated 
the benefits of using existing an Broadcast News 
topic hierarchy extracted from topic labels as a ba- 
sis for language model computation. Manual tree 
construction and hierarchical interpolation yielded 
a 16% perplexity reduction over a baseline uni- 
gram model. In a concur rent collaborative effort, 
Khudanpur and Wu (1999 ) implemented clustering 
and topic-detection techniques similar on those pre- 
sented here and computed a maximum entropy topic 
sensitive language model for the Switchboard cor- 
pus, yielding 8% perplexity reduction and 1.8% word 
error rate reduction relative to a baseline maximum 
entropy trigram model. 



2 The Data 

The data used in this research is the Broadcast News 
(BN94) corpus, consisting of radio and TV news 
transcripts form the year 1994. From the total of 
30226 documents, 20226 were used for training and 
the other 10000 were used as test and held-out data. 
The vocabulary size is approximately 120k words. 

3 Optimizing Document Clustering 
for Language Modeling 

For the purpose of language modeling, the topic la- 
bels assigned to a document or segment of a doc- 
ument can be obtained either manually (by topic- 
tagging the documents) or automatically, by using 
an unsupervised algorithm to group similar docu- 
ments in topic-like clusters. We have utilized the 
latter approach, for its generality and extensibility, 
and because there is no reason to believe that the 
manually assigned topics are optimal for language 
modeling. 

3.1 Tree Generation 

In this study, we have investigated a range of hierar- 
chical clustering techniques, examining extensions of 
hierarchical agglomerative clustering, fc-means clus- 
tering and top-down EM-based clust ering. The lat - 
ter underperformed on evaluations in Florian (199S) 
and is not reported here. 

A generic hierarchical agglomerative clustering al- 
gorithm proceeds as follows: initially each document 
has its own cluster. Repeatedly, the two closest clus- 
ters are merged and replaced by their union, until 
there is only one top-level cluster. Pairwise docu- 
ment similarity may be based on a range of func- 
tions, but to facilitate comparative analysis we have 
utilized standard cosine similarity {d{Di, D2) = 
— ^^^'"^^^ — and IR-style term vectors (see Salton 



\\D1U\D2 , 

land McGillTT98^ )). 

This procedure outputs a tree in which documents 
on similar topics (indicated by similar term content) 
tend to be clustered together. The difference be- 
tween average-linkage and maximum-linkage algo- 
rithms manifests in the way the similarity bet ween 
clusters is computed (see Duda and Hart (1973|) ). A 
problem that appears when using hierarchical clus- 
tering is that small centroids tend to cluster with 
bigger centroids instead of other small centroids, of- 
ten resulting in highly skewed trees such as shown 
in Figure ^, a=0. To overcome the problem, we de- 
vised two alternative approaches for computing the 
intercluster similarity: 

• Our first solution minimizes the attraction of 
large clusters by introducing a normalizing fac- 
tor a to the inter-cluster distance function: 

<c(Cl),c(C2) > 



d{Cl,C2) = 



A^(Ci)"||c(Ci)|17V(C2)"||c(C2)|| 



(2) 




a — a — 0.3 a — 0.5 



Figure 2: As a increases, the trees become more 
balanced, at the expense of forced clustering 




e = e = 0.15 e = 0.3 e = 0.7 



Figure 3: Tree-balance is also sensitive to the 
smoothing parameter e. 

where N (Ck) is the number of vectors (docu- 
ments) in cluster Ck and c{Ci) is the centroid 
of the i"^ cluster. Increasing a improves tree 
balance as shown in Figure 2, but as a becomes 
large the forced balancing degrades cluster qual- 
ity. 

• A second approach we explored is to perform 
basic smoothing of term vector weights, replac- 
ing all O's with a small value e. By decreasing 
initial vector orthogonality, this approach facili- 
tates attraction to small centroids, and leads to 
more balanced clusters as shown in Figure 3. 

Instead of stopping the process when the desired 
number of clusters is obtained, we generate the full 
tree for two reasons: (1) the full hierarchical struc- 
ture is exploited in our language models and (2) once 
the tree structure is generated, the objective func- 
tion we used to partition the tree differs from that 
used when building the tree. Since the clustering 
procedure turns out to be rather expensive for large 
datasets (both in terms of time and memory) , only 
10000 documents were used for generating the initial 
hierarchical structure. 

" Section describes the choice of optimum a. 



3.2 Optimizing the Hierarchical Structure 

To be able to compute accurate language models, 
one has to have sufHcient data for the relative fre- 
quency estimates to be reliable. Usually, even with 
enough data, a smoothing scheme is employed to in- 
sure that P > for any given word sequence 

The trees obtained from the previous step have 
documents in the leaves, therefore not enough word 
mass for proper probability estimation. But, on the 
path from a leaf to the root, the internal nodes grow 
in mass, ending with the root where the counts from 
the entire corpus are stored. Since our intention is to 
use the full tree structure to interpolate between the 
in-node language models, we proceeded to identify 
a subset of internal nodes of the tree, which contain 
sufficient data for language model estimation. The 
criteria of choosing the nodes for collapsing involves 
a goodness function, such that the cu^ is a solu- 
tion to a constrained optimization problem, given 
the constraint that the resulting tree has exactly k 
leaves. Let this evaluation function be g{n), where 
71 is a node of the tree, and suppose that we want 
to minimize it. Let g{n, k) be the minimum cost of 
creating k leaves in the subtree of root n. When the 
evaluation function g {n) satisfies the locality con- 
dition that it depends solely on the values g{nj,-), 
(where (nj)j^i ^are the children of node n), g {root) 
can be computed efficiently using dynamic program- 
mingg : 

9{n, 1) = g{n) 

g{n,k)= min h {g {m, ji) , . . . , g (uk, jk)){3) 
ji , , jfc > 1 

Let us assume for a moment that we are inter- 
ested in computing a unigram topic-mixture lan- 
guage model. If the topic-conditional distributions 
have high entropy (e.g. the histogram of P{w\topic) 
is fairly uniform) , topic-sensitive language model in- 
terpolation will not yield any improvement, no mat- 
ter how well the topic detection procedure works. 
Therefore, we are interested in clustering documents 
in such a way that the topic-conditional distribution 
P{w\topic) is maximally skewed. With this in mind, 
we selected the evaluation function to be the condi- 
tional entropy of a set of words (possibly the whole 
vocabulary) given the particular classification. The 
conditional entropy of some set of words W given a 
partition C is 

n 

H{w\c) = J2 Pico E PHCi) ■ iog(P(w|co) 

n 

= ^E E ciw,a)-iog{Piw\a))i4) 

^the collection of nodes that collapse 

'^h is an operator through which the values 
g ("iiii) 1 • ■ ■ 1 9 {i^ktjk) are combined, as or ]~[ 



Conditional Entropy in the Average-Linkage Case Conditional Entropy in die Maximum-Linlcage Case 




Figure 4: Conditional entropy for diflerent a, cluster sizes and linkage methods 



where c{'w,Ci) is the TF-IDF factor of word w in 
class Ci and T is the size of the corpus. Let us 
observe that the conditional entropy does satisfy the 
locality condition mentioned earlier. 

Given this objective function, we identified the op- 
timal tree cut using the dynamic-programming tech- 
nique described above. We also optimized different 
parameters (such as a and choice of Hnkage method) . 

Figure]^ illustrates that for a range of cluster sizes, 
maximal linkage clustering with q;=0. 15-0.3 yields 
optimal performance given the objective function in 
equation (||). 

The effect of varying a is also shown graphically in 
Figure 5. Successful tree construction for language 
modeling purposes will minimize the conditional en- 
tropy of P{W\C). This is most clearly illustrated 
for the word politics, where the tree generated with 
a = 0.3 maximally focuses documents on this topic 
into a single cluster. The other words shown also 
exhibit this desirable highly skewed distribution of 
P {W|C) in the cluster tree generated when a — 0.3. 

Another investigated approach was k- means clus- 
tering (see Duda and Hart (1973)) as a robust and 
proven alternative to hierarchical clustering. Its ap- 
plication, with both our automatically deriv ed clus- 
ters a nd Mangu's manually derived clusters ( Mangu 
(1997)) used as initial partitions, actually yielded a 
small increase in conditional entropy and was not 
pursued further. 

4 Language Model Construction and 
Evaluation 

Estimating the language model probabilities is a 
two-phase process. First, the topic-sensitive lan- 
guage model probabilities P {wi\t,wlz]n+\) com- 
puted during the training phase. Then, at run-time, 
or in the testing phase, topic is dynamically iden- 
tified by computing the probabilities P (t\w]~'^^ as 
in section 4.2 and the final language model proba- 
bilities are computed using Equation (m). The tree 



used in the following experiments was generated us- 
ing average-linkage agglomerative clustering, using 
parameters that optimize the objective function in 
Section |. 

4.1 Language Model Construction 

The topic-specific language model probabilities are 
computed in a four phase process: 

1. Each document is assigned to one leaf in the 
tree, based on the similarity to the leaves' cen- 
troids (using the cosine similarity). The doc- 
ument counts are added to the selected leaf's 
count. 

2. The leaf counts are propagated up the tree such 
that, in the end, the counts of every inter- 
nal node are equal to the sum of its children's 
counts. At this stage, each node of the tree has 
an attached language model - the relative fre- 
quencies. 

3. In the root of the tree, a discounted Good- 



Turing language model is comput ed (see Katz 
(19871 ), IChen and Goodman (1998|) ). 



m-gram smooth language models are computed 
for each node n different than the root by 
three-way interpolating between the m-gram 
language model in the parent parent{n), the 
(m — l)-gram smooth language model in node 
n and the m-gram relative frequency estimate 
in node n: 



J pareiit(n) 



(5) 



with \i {w^-') + Xl {wT-^) + \l {w^-^) = 1 
for each node n in the tree. Based on how 
A* (w™"^) depend on the particular node n and 
the word history w™~^, various models can be 
obtained. We investigated two approaches: a 
bigram model in which the A's are fixed over 
the tree, and a more general trigram model in 



• Case 1: fnode (wi) 

-Proot (W2|«'l) _ if «)2 G 

^1 fnode («'2|«'l) ■ 7node + ->*2-Pnode (""2) 

+ (1 - Ai - A2)Pparent(node) («'2|«i'l) if W2 € TZ (wi) 

"node Aiode («'2) if 1«2 € W {wl) 



-Pnode (l"2\wi} 



where 



1- 



fnode (™2 I mi) ''I S tnode(™2l™l) 



7node = 7-7—; : ; : r. "node (wi) = 7 ' \ 

(1 + /5) 2^ fnode(™2l™l) / ' 

"'2e7i(™l) (1 + /3) 1- J2 A,ode(™2) 

\ ™2e^(™l)UTC(™i) , 



• Case 2: fnode ('"i) = 

-Pnode {W2\wi) -- 

where 7node ('^i) and «node ("^1) ''re computed in a similar fashion such that the probabilities do sum to 1. 



PToot{w2\w\) \'iw2&T{w\) 

-^2 -Pnode («'2) ■ 7node 

+ {1 - A3)^Pparent(node) («'2hl) if «)2 € 7e(wi) 

"node (wi) -Pnode («"2) if W2 £ W 



Figure 5: Basic Bigram Language Model Specifications 



which A's adapt using an EM reestimation pro- 
cedure. 

4.1.1 Bigram Language Model 



Not all words are topic sensitive. |Mangu (1997| ) ob- 
served that closed-class function words (FW), such 
as i/ie, 0/, and wii/i, have minimal probability vari- 
ation across different topic parameterizations, while 
most open-class content words (CW) exhibit sub- 
stantial topic variation. This leads us to divide the 
possible word pairs in two classes (topic-sensitive 
and not) and compute the A's in Equation (||) in 
such a way that the probabilities in the former set 
are constant in all the models. To formalize this: 

• T (w\) — {w2 S V| (wi, W2) is fixed}-the 
"fixed" space; 

• 7?.(wi) = {«^2 G V| (wi, W2) is free/variable}- 
the "free" space; 

• U.{w\) = {w2 € V\ {wi,W2) was never seen}- 
the "unknown" space. 

The imposed restriction is, then: for every word 
wiand any word W2 G J-' {wi) Pn(w2|wi) — 
Proot {w2\wi) in any node n. 

The distribution of bigrams in the training data 
is as follows, with roughly 30% bigram probabilities 
allowed to vary in the topic-sensitive models: 

This approach raises one interesting issue: the 
language model in the root assigns some probabil- 
ity mass to the unseen even t s, equal to t he single- 
tons' mass (see [Good (1953| ) jKatz (19871 )). In our 
case, based on the assumptions made in the Good- 
Turing formulation, we considered that the ratio of 
the probability mass that goes to the unseen events 
and the one that goes to seen, free events should be 



Model 


Bigram-type 


Example 


Freq. 




fixed 


p{FW\FW) 


p(the|i,i) 


45.3% 


least topic sensitive 


fixed 


p(FW\CW) 


p(o/|j,cer.a,-io) 


24.8% 


I 


tree 


V(CW\CW) 


p(air|co!d) 


5.3% 


I 


free 


p(CW\FW) 


p(air|tfie) 


24.5% 


most topic sensitive 



fixed over the nodes of the tree. Let (3 be this ratio. 
Then the language model probabilities are computed 
as in Figure ||. 

4.1.2 Ngram Language Model Smoothing 

In general, n gram language model probabili- 
ties can be computed as in formula (^, where 
(a* (i«™~^))j._^ g are adapted both for the partic- 
ular node n and history «)™~^- The proposed de- 
pendency on the history is realized through the his- 
tory count C (iy™~^) and the relevance of the history 
iu™~^ to the topic in the nodes n and parent {n). 
The intuition is that if a history is as relevant in the 
current node as in the parent, then the estimates in 
the parent should be given more importance, since 
they are better estimated. On the other hand, if the 
history is much more relevant in the current node, 
then the estimates in the node should be trusted 
more. The mean adapted A for a given height h 
is the tree is shown in Figure ||. This is consistent 
with the observation that splits in the middle of the 
tree tend to be most informative, while those closer 
to the leaves suffer from data fragmentation, and 
hence give relatively more weight to their parent. 
As before, since not all the m-grams are expected to 
be topic-sensitive, we use a method to insure that 
those m grams are kept "fixed" to minimize noise 
and modeling effort. In this case, though, 2 lan- 
guage models with different support are used: one 



It is at least on the Serb side a real setback to the | ? | 
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Figure 7: Topic sensitive probability estimation for peace and piece in context 
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Figure 6: Mean of the estimated As at node height 
h, in the unigram case 

that supports the topic insensitive m-grams and that 
is computed only once (it's a normalization of the 
topic-insensitive part of the overall model) , and one 
that supports the rest of the mass and which is com- 
puted by interpolation using formula (||). Finally, 
the final language model in each node is computed 
as a mixture of the two. 

4.2 Dynamic Topic Adaptation 

Consider the example of predicting the word follow- 
ing the Broadcast News fragment: "It is at least on 
the Serb side a real drawback to the | ? | ". Our topic 
detection model, as further detailed later in this sec- 
tion, assigns a topic distribution to this left context 
(including the full previous discourse), illustrated in 
the upper portion of Figure 0. The model identi- 
fies that this particular context has greatest affinity 
with the empirically generated topic clusters #41 
and #42 (which appear to have one of their foci on 



international events). 

The lower portion of Figure ^ illustrates the topic- 
conditional bigram probabilities P {withe, topic) for 
two candidate hypotheses for w: peace (the actu- 
ally observed word in this case) and piece (an in- 
correct competing hypothesis). In the former case, 
P{peace\the, topic) is clearly highly elevated in the 
most probable topics for this context (#41, #42), 
and thus the application of our core model combi- 
nation (Equation ^ yields a posterior joint product 

P =^tiP ■ H<-l+i) that is 

12-times more likely than tlie overall bigram proba- 
bility, P(air|the) = 0.001. In contrast, the obvious 
accustically motivated alternative piece, has great- 
est probability in a far different and much more dif- 
fuse distribution of topics, yielding a joint model 
probability for this particular context that is 40% 
lower than its baseline bigram probability. This 
context-sensitive adaptation illustrates the efficacy 
of dynamic topic adaptation in increasing the model 
probability of the truth. 

Clearly the process of computing the topic de- 
tector P (tlwl'^") is crucial. We have investigated 
several mechanisms for estimating this probability, 
the most promising is a class of normalized trans- 
formations of traditional cosine similarity between 
the document history vector wl'^ and the topic cen- 
troids: 



/(Cosine-Sim {t,wi-^)) 
J2 f (Cosine-Sim (t', w{''^)) 



(6) 



One obvious choice for the function / would be the 
identity. However, considering a linear contribution 



Language 
Model 


Perplexity on 
the entire 
vocabulary 


Perplexity on 
the target 
vocabulary 


Standard Bigram Model 


215 


584 


Topic LMs 


History size 


Scaled 




fix) 


k-NN 




100 


yes 


X 


x' 




206 


460 


1000 


yes 


X 


x' 




195 


405 


5000 


yes* 


X* 


x'* 




192 (-10%) 


389(-33%) 


5000 


yes 


1 


X 




202 


444 


5000 


no 


X 


x' 




193 


394 


5000 


yes 


X 


x' 


15-NN 


192 


390 


5000 


yes 




xe^ 




196 


411 



Table 1: Perplexity results for topic sensitive bigram language model, different history lengths 



of similarities poses a problem: because topic de- 
tection is more accurate when the history is long, 
even unrelated topics will have a non-trivial contri- 
bution to the final probability]^ resulting in poorer 
estimates. 

One class of transformations we investigated, that 
directly address the previous problem, adjusts the 
similarities such that closer topics weigh more and 
more distant ones weigh less. Therefore, / is chosen 
such that 



«£i) < £1 for xi<X2^ 

xi — X2 — 



(7) 



that is, should be a monotonically increas- 

ing function on the interval [0,1], or, equivalently 
f (x) — X ■ g {x), g being an increasing function on 
[0, 1]. Choices for g{x) include x, x'^{-f > 0), log (x), 

Another way of solving this problem is through the 
scaling operator f (x,) = ■ By apply- 



ing this operator, minimum values (corresponding to 
low-relevancy topics) do not receive any mass at all, 
and the mass is divided between the more relevant 
topics. For example, a combination of scaling and 
g{x) = x^ yields: 



P{t,\w\-') 



5.^(^;-\t,)-„,n,5.^(„;-\.,) V (8) 

.Xfc Si™(™pl.tfc)-mi„fc Si™(„,^-l,tfc) 



A third class of transformations we investigated 
considers only the closest k topics in formula (^) 
and ignores the more distant topics. 

4.3 Language Model Evaluation 

Table |l| briefly summarizes a larger table of per- 
formance measured on the bigram implementation 

^Due to unimportant word co-occurrences 



of this adaptive topic-based LM. For the default 
parameters (indicated by *), a statistically signif- 
icant overall perplexity decrease of 10.5% was ob- 
served relative to a standard bigram model mea- 
sured on the same 1000 test documents. System- 
atically modifying these parameters, we note that 
performance is decreased by using shorter discourse 
contexts (as histories never cross discourse bound- 
aries, 5000-word histories essentially correspond to 
the full prior discourse). Keeping other parame- 
ters constant, g{x) — x outperforms other candidate 
transformations g{x) = 1 and g{x) = . Absence 
of k-nn and use of scaling both yield minor perfor- 
mance improvements. 

It is important to note that for 66% of the vo- 
cabulary the topic-based LM is identical to the core 
bigram model. On the 34% of the data that falls in 
the model's target vocabulary, however, perplexity 
reduction is a much more substantial 33.5% improve- 
ment. The ability to isolate a well-deflned target 
subtask and perform very well on it makes this work 
especially promising for use in model combination. 

5 Conclusion 

In this paper we described a novel method of gen- 
erating and applying hierarchical, dynamic topic- 
based language models. Specifically, we have pro- 
posed and evaluated hierarchical cluster genera- 
tion procedures that yield specially balanced and 
pruned trees directly optimized for language mod- 
eling purposes. We also present a novel hierar- 
chical interpolation algorithm for generating a lan- 
guage model from these trees, speciaHzing in the 
hierarchical topic-conditional probability estimation 
for a target topic-sensitive vocabulary (34% of the 
entire vocabulary). We also propose and evalu- 
ate a range of dynamic topic detection procedures 
based on several transformations of content-vector 
similarity measures. These dynamic estimations of 
P(topiCi\hi story) are combined with the hierarchical 
estimation of P{wordj\topiCi, history) in a product 
across topics, yielding a final probability estimate 



of P{wordj\history) that effectively captures long- 
distance lexical dependencies via these intermediate 
topic models. Statistically significant reductions in 
perplexity are obtained relative to a baseline model, 
both on the entire text (10.5%) and on the target 
vocabulary (33.5%). This large improvement on a 
readily isolatable subset of the data bodes well for 
further model combination. 
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